NetNews Offline 2

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Offline 2 / NetNews Offline Volume 2.iso / news / comp / lang / c-part1 / 2441 < prev next >

Wrap

Text File | 1996-08-05 | 6.1 KB | 168 lines

Path: pm074-11.dialip.mich.net!user From: nkuitse@umich.edu (Paul M. Hoffman) Newsgroups: comp.lang.c,comp.sys.mac.programmer.tools Subject: Re: HELP ME PLEASE: Thesaurus function / Thesuarus lists needed Date: Sun, 21 Jan 1996 10:57:59 -0500 Organization: University of Michigan Message-ID: <nkuitse-2101961057590001@pm074-11.dialip.mich.net> References: <4dj5ke$uo2@news.clandjop.com> NNTP-Posting-Host: pm074-11.dialip.mich.net In article <4dj5ke$uo2@news.clandjop.com>, eday@clandjop.com wrote: >I need to add a thesaurus function to a program. I need some >help/guidance in starting this project. [snip] >So, here I am, asking: Where can I go, who do I ask, what can I do to >get this project going? I'm doing something similar, using data based on a Project Gutenberg edition of Roget's Thesaurus -- this is the last public-domain edition, from 1911, plus a few modern-day changes. Here's a description from the file: >********************************************************************* >** Thesaurus-1911 ** >** Version 1.02 (supplemented: July 18, 1991) ** >********************************************************************* ><-- An electronic thesaurus derived from the version of Roget's >Thesaurus published in 1911. > This thesaurus has been prepared by MICRA, Inc. (May 1991). [snip] > Note that this version of Thesaurus-1911 has been supplemented >with over 1,000 words not present in the original 1911 edition, but >many modern words are still missing. About 1500 verbs (out of 6500) >which can be found in an 80,000-word spell-checker are absent from >this work. Here's one place to get it (URL from an Archie search in Anarchie): ftp://ftp.uu.net//doc/literary/gutenberg/etext91/roget14a.txt.Z It's 600K or so, 1400K when uncompressed (use MacCompress). It's split up by meaning into about 1000 entries, like its modern-day descendant, Roget's International Thesaurus. Here's an example of (part of) an entry: > #45. [Connecting medium.] Connection. -- N. vinculum, link; >connective, connection; junction &c. 43; bond of union, copula, >hyphen, intermedium[obs3]; bracket; bridge, stepping-stone, isthmus. > bond, tendon, tendril; fiber; cord, cordage; riband, ribbon, >rope, guy, cable, line, halser|, hawser, painter, moorings, wire, >chain; string &c. (filament) 205. I have a cleaned-up version which I can make available if you're interested; it looks like this: >#45. [Connecting medium] Connection ><N> > @vinculum, link; connective, connection; junction <etc> 43; >bond of union, copula, hyphen, intermedium[<obs>]; bracket; bridge, >stepping-stone, isthmus. > @bond, tendon, tendril; fiber; cord, cordage; riband, ribbon, >rope, guy, cable, line, halser[<pipe>], hawser, painter, moorings, >wire, chain; string <etc> (filament) 205. Once you have the data, you need to do some serious Perl (or some such) coding. One (relatively) quick and simple way to make it usable is as follows. Generate an intermediate index by word, looking something like this: ... bond 43 45 79 402 406 768 bondage 43 402 612 ... Presbyterian 509 993 ... (Use Perl associative arrays; it shouldn't be too hard, if perhaps a bit memory-intensive -- I give MacPerl 8M+ for scripts like this. I can e-mail you sample Perl scripts if you like.) Generate a hash table from this. You might use a "dumb" hash table that doesn't resolve collisions: say that "bondage" and "Presbyterian" hash to the same slot, which points to a sequence of entry numbers that is the union of those for "bondage" and "Presbyterian": 43 45 79 402 406 509 768 993 Read all those entries and discard the ones that don't contain the word "bondage". You'll need an index that takes you from the entry number to its byte offset within the file. Again, use Perl to create it: 1 : 0 2 : 409 3 : 745 4 : 1074 Munge this into a Rez resource rescription (excuse my questionable syntax here): resource 'EDAT' 2000 { 0, 409, 745, 1074, ... }; or to an array declaration in C, or whatever. Lastly, put all of this stuff into "chunks" in the data fork of a file: 1 <header, including offsets of chunks 2-5> 2 <hash table proper: each entry is an offset into chunk 3> 3 <entry number sequences referenced by the hash table> 4 <table mapping entry numbers to entry offsets within chunk 5> 5 <the thesaurus data itself> Keep chunks 1, 2, 4 and (if you can) 3 in memory. You might want a way to keep often-used parts of chunk 5 in memory, too, but that complicates things a little. (Hint: use a simple cache, an array of pointers to, say, the 16 most recently used entries. For people who like to "branch" through thesauri, this should make things much quicker.) Et voila` (waving hands) you have a usable thesaurus! Well, more or less. The only really hard part should be the hash table: I've made collision resolution sound easy (by ignoring it), but you will run into problems if you're not careful. Try using a variety of hash functions and pick the one that yields the best results. Here's a pseudo-C skeleton of the look-up algorithm: Boolean LookUp (String word, StringArray results) { /* Return each entry that contains the specified word */ UnsignedInt h, entryNum; Ptr p; String entry; Boolean success = false; h = HashWord (word) mod kHashTableSize; for (each entryNum in slot h of the hash table) { ReadEntryIntoString (entryNum, entry); if (entry contains word) { AddStringToArray (results, entry); success = true; } } return success; } This isn't the most elegant way to do it, but it's simple. (Tempting alternative: use a separate resource for each entry. Don't do it: it may sound good, but it looks like gross Resource Manager abuse to me. A thousand resources is an awful lot.) I love the smell of Perl-generated C code in the morning! Good luck, Paul. -- Paul Hoffman | mailto:paul.hoffman@umich.edu Taubman Medical Lib. | http://www-personal.umich.edu/~nkuitse/ Univ. of Michigan | "Dragons do not enter into this message"